from google.colab import drive
'/content/drive') drive.mount(
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
August 15, 2021
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Recommendation system is popular nowadays. They are used to predict the “rating” or “preference” that users would give to an item. Those information can be used to provide users useful suggestions. For example, Amazon uses it to suggest products to customes, while Nexflix uses it to recommend videos based on user’s favor.
Generally, there are three types of recommendation system: 1. Simple recommenders: provide recommendation based on items’ popularity or ratings. For example, the movies in IDMB top 250. 2. Content-based recommenders: suggest items based on other item properties. The system assumes that if a person likes a particular item, he or she will also like an item which is similar to it. For example, Netflix suggests new movies based on the user’s history. 3. Collaborative filtering engines: predict the rating or preference that a user would give an item based on past ratings and preferences of other users.
In this post, we will build a content-based recommendation system for movies using the MovieLens Dataset
. Since the dataset is large (26 miliion ratings and 750,000 tag applications), we only use a subset of it for fast development.
You can download the dataset here.
import pandas as pd
metadata = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/dataset/archive/movies_metadata.csv", low_memory=False)
metadata.head(3)
adult | belongs_to_collection | budget | genres | homepage | id | imdb_id | original_language | original_title | overview | popularity | poster_path | production_companies | production_countries | release_date | revenue | runtime | spoken_languages | status | tagline | title | video | vote_average | vote_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | {'id': 10194, 'name': 'Toy Story Collection', ... | 30000000 | [{'id': 16, 'name': 'Animation'}, {'id': 35, '... | http://toystory.disney.com/toy-story | 862 | tt0114709 | en | Toy Story | Led by Woody, Andy's toys live happily in his ... | 21.946943 | /rhIRbceoE9lR4veEXuwCC2wARtG.jpg | [{'name': 'Pixar Animation Studios', 'id': 3}] | [{'iso_3166_1': 'US', 'name': 'United States o... | 1995-10-30 | 373554033.0 | 81.0 | [{'iso_639_1': 'en', 'name': 'English'}] | Released | NaN | Toy Story | False | 7.7 | 5415.0 |
1 | False | NaN | 65000000 | [{'id': 12, 'name': 'Adventure'}, {'id': 14, '... | NaN | 8844 | tt0113497 | en | Jumanji | When siblings Judy and Peter discover an encha... | 17.015539 | /vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg | [{'name': 'TriStar Pictures', 'id': 559}, {'na... | [{'iso_3166_1': 'US', 'name': 'United States o... | 1995-12-15 | 262797249.0 | 104.0 | [{'iso_639_1': 'en', 'name': 'English'}, {'iso... | Released | Roll the dice and unleash the excitement! | Jumanji | False | 6.9 | 2413.0 |
2 | False | {'id': 119050, 'name': 'Grumpy Old Men Collect... | 0 | [{'id': 10749, 'name': 'Romance'}, {'id': 35, ... | NaN | 15602 | tt0113228 | en | Grumpier Old Men | A family wedding reignites the ancient feud be... | 11.7129 | /6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg | [{'name': 'Warner Bros.', 'id': 6194}, {'name'... | [{'iso_3166_1': 'US', 'name': 'United States o... | 1995-12-22 | 0.0 | 101.0 | [{'iso_639_1': 'en', 'name': 'English'}] | Released | Still Yelling. Still Fighting. Still Ready for... | Grumpier Old Men | False | 6.5 | 92.0 |
Our recommendation system will be based on the similarity between the movie overviews. Specifically, we will compute the pairwise cosine
similarity scores for all movies and suggest the movies based on this score.
First of all, we have to transform the raw text to vector form sincewe cannot compute the similarity score directly from the raw text. In this post, we will compute the Term Frequency-Inverse Document Frequency (TF-IDF)
vectors for each document.
From the shape of the matrix we can see that the vector has length of 75827 and we have 45466 movie overview in total.
['avails',
'avaks',
'avalanche',
'avalanches',
'avallone',
'avalon',
'avant',
'avanthika',
'avanti',
'avaracious']
After generating vector for each movie overview, we can start computing the similarity score between them. There are many ways to do that besides cosine similarity
, such as the manhantatan
, euclidean
, the Pearson
, etc. There is no right or wrong answer to which score is the best. Different scores will work well in different situations. It is always encouraged to experiment with different metrics and choose the best.
def get_recommendations(title, cosine_sim=cosine_sim):
idx = indices[title]
sim_scores = list(enumerate(cosine_sim[idx]))
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
sim_scores = sim_scores[1:11]
movie_indices = [i[0] for i in sim_scores]
return metadata['title'].iloc[movie_indices]
45464 Satan Triumphant
45463 Betrayal
45462 Century of Birthing
45461 Subdue
45460 Robin Hood
45459 Caged Heat 3000
45458 The Burkittsville 7
45457 Shadow of the Blair Witch
45456 House of Horrors
45455 St. Michael Had a Rooster
Name: title, dtype: object
1178 The Godfather: Part II
44030 The Godfather Trilogy: 1972-1990
1914 The Godfather: Part III
23126 Blood Ties
11297 Household Saints
34717 Start Liquidation
10821 Election
38030 A Mother Should Be Loved
17729 Short Sharp Shock
26293 Beck 28 - Familjen
Name: title, dtype: object
Here we will discuss a bit the motivation behind TF-IDF
Term frequency Give a set of English text documents, we want to rank them by which document is more relevant to the query, for example, “the excellent student”. Firstly, we can simply filter out the documents that do not contain all 3 words - “the”, “excellent” and “student”. However, there are still many documents left. To further distinguish them, we might count the frequency of those 3 words in each document and rank them by corresponding frequencies. That frequency is called the term frequency
. Since the length of the document may vary significantly, we often normalize the frequency of each word by the length of the document.
Inverse document frequency Some terms are more common than the other. For example, the term “the” is more popular than the word “excellent”. Term frequency tends to incorrectly emphasize documents which happen to use the word “the” more frequently, without giving enough weight to more meaningful terms such as “excellent” and “student”. Yet, the term “the” is not a good key word to distinguish the relevant and non-relevant documents. The inverse document frequency is used to diminished the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.